Polling: An introduction

A lot of people here read polls.  I’m a polling addict, myself.  

But a lot of what people think about polls is, well…. uninformed.  I’m a statistician.  

Before we jump below the fold, this is not going to be about any particular poll, or any particular race, or any particular anything.  It’s general

crossposted to docudharm and dailyKos

A poll is a type of survey, designed to estimate how people will vote.  That ‘estimate’ is key.  Even a perfect poll is not going to be exactly right.  There are two ways it can go wrong: It can be biased or it can be *inaccurate*.  Now, both of those are English words, but statisticians use them in a particular way, not exactly like ordinary usage.

Bias means that it is systematically wrong.

Inaccurate means that it is unsystematically wrong.  

Bias need not be deliberate, although it can be.  The ultimate in deliberate bias is the ‘push poll’: “Recent reports indicate that Joe Nogoodnik may have been indicted for rape in the past.  Are you  voting for Joe Nogoodnik, or his opponent, Sue Baddata?”  Somewhat more subtle (and less predictable) is the sort of question I once got asked “This is Bella Abzug for Mayor headquarters.  Are you voting for Bella Abzug, or her opponent, Ed Koch?”  That’s not a push poll, but it isn’t a very good one.

But there are much more subtle biases.  People answer differently when asked about Hillary Rodham Clinton vs. Hillary Clinton (the former is less popular, go figure).  In a fascinating result, it was found that people answer questions about racism differently if the person asking the question has a southern accent.  Could that affect results about, say Obama? Sure.

I’d bet more people say they’ll vote for Clinton if the person asking is female.

People are more likely to prefer the first candidate in a list.  Good pollsters rotate order.

Another way polls can be biased is in choosing a sample badly.  The most famous case of this is the famous Literary Digest poll that showed Alf Landon beating FDR in a landslide.  Oops.  FDR won.  Landon got only VT and ME.  The poll surveyed 10 million people, and they got 2 million replies.  What went wrong?  Bias.  The survey went to Literary Digest subscribers (most of whom were fairly wealthy), car owners (even now, not a proportional sample, and, in 1936, not remotely close), and telephone users (again, not a good sample now, and much worse in 1936).  

These days, no pollster makes quite that big an error.  But many come close.  All the internet polls are based on people who *volunteer*, one way or another, to be surveyed.  These are not a random sample.  And, while there are ways to correct for some biases (e.g. if your sample is more male than the population) there is no way to correct for this sort of thing.

Then there’s accuracy.  Polls with larger samples are more accurate.  The  Literary Digest poll was very, very accurate.  It was just accurate about the wrong thing.  As a famous statistician (George Box) once said:

An approximate answer to the right question is better than an exact answer to the wrong question

 

Poll results are typically reported with a margin of error.  This is widely misinterpreted.  It is an attempt to estimate how likely it is that the result shown is within a certain range.  (What a sentence!) But what we’re really interested in is something else: That is, if the true result is something, how likely are these results?  Now, let’s do some simulating.  Suppose that, the TRUTH is that 50% of everyone who will vote prefer Joe Shmo, and 50% prefer Jon Noone.

What will happen if we ask 100 people, properly chosen?  Let’s do it.  The first 10 times, I got a responses of

42, 55, 43, 53, 51, 52, 52, 56, 49

note that not one of them was right!  They were off by as much as 8 points.

Now, what if we asked 1000 people each time?

52.1, 48.1, 51.1, 49.8, 46.0, 48.4, 50.8, 53.3, 49.8, 47.5

notice that the numbers are closer to the right number.  Still, in one case we were off by 3.3 points, and in one case by 4.

That’s OK if there are two candidates, each at about 50%.  What about, say, the situation in IA on the Dem side, where it seems like we have 3 candidates (Barrack, Hillary, John) at about .33 each? (we’ll ignore the remaining candidates).  I did this 10 times with a sample of 100 each.  Proportions ranged from .24 to .41.  In other words, if there were ten polls done, each with 100 people, results might look like this



      1    2    3

1   0.34 0.33 0.33

2   0.27 0.34 0.39

3   0.35 0.30 0.35

4   0.25 0.34 0.41

5   0.36 0.32 0.32

6   0.32 0.37 0.31

7   0.31 0.28 0.41

8   0.32 0.36 0.32

9   0.36 0.35 0.29

10   0.24 0.36 0.40



where the rows are polls and the columns are candidates.

What if each had 1,000 people?



     1     2     3

1   0.307 0.319 0.374

2   0.329 0.324 0.347

3   0.327 0.321 0.352

4   0.306 0.351 0.343

5   0.317 0.354 0.329

6   0.332 0.360 0.308

7   0.314 0.336 0.350

8   0.308 0.324 0.368

9   0.343 0.312 0.345

10  0.334 0.312 0.354

much better

a typical poll has about 500 respondents, which looks like

      1     2     3

1   0.346 0.316 0.338

2   0.318 0.340 0.342

3   0.328 0.334 0.338

4   0.342 0.330 0.328

5   0.318 0.344 0.338

6   0.318 0.340 0.342

7   0.302 0.322 0.376

8   0.328 0.314 0.358

9   0.372 0.332 0.296

10  0.350 0.302 0.348

so, let’s say column 1 is Clinton, column 2 is Edwards, column 3 is Obama (alphabetical)

Is Hillary leading John by 4 and Barrack by 7? (row 9)

or

Is Barrack leading John by 5 and Hillary by 7? (row 7)

or

Is it very very close (row 3)?

Remember, we’re going to see only the rows.

One way around this is to look at sites like political arithmetik] and pollster.com that look at lots of polls and graph them.  The former site is updated less often, but offers lots of insight.

And one way to exacerbate this difficulty (without lying) is to only cite polls that favor your candidate.  Borderline lying, a candidate could sponsor five polls, and only release the one that favors him (or her) the most.  How would that affect things?

Let’s go back to the 3 candidates, with all about equal scenario.  Now, let’s say each candidate sponsors five polls, each with 500 respondents.  

So, candidate 1 gets these results

    1     2     3

1  0.310 0.326 0.364

2  0.294 0.338 0.368

3  0.370 0.308 0.322

4  0.292 0.348 0.360

5  0.322 0.312 0.366

and reports row 3.  He is leading by 5 points

candidate 2 gets these results

1   0.302 0.342 0.356

2   0.298 0.336 0.366

3   0.332 0.356 0.312

4   0.296 0.344 0.360

5   0.330 0.320 0.350

and reports row 3…. she is leading by 2 points

Candidate 3 gets

      1     2     3

1   0.322 0.332 0.346

2   0.322 0.332 0.346

3   0.368 0.322 0.310

4   0.310 0.326 0.364

5   0.374 0.286 0.340

and reports row 4, he is leading by 4

so, does that mean that polls are worthless? No.  It means they can be abused.  

Does it mean that results within the margin of error are the same? No.  Because, if the truth were that candidate 1 had 37%, candidate 2 29%, and candidate 3 had 34%, then results would look like this

     1     2     3

1   0.404 0.292 0.304

2   0.378 0.246 0.376

3   0.392 0.284 0.324

4   0.378 0.268 0.354

5   0.364 0.332 0.304

And all that is just about single polls!  If people are interested, I can do another one where I simulate trends